Training SpamAssassin with Active Semi-supervised Learning

نویسندگان

  • Jun-Ming Xu
  • Giorgio Fumera
  • Fabio Roli
  • Zhi-Hua Zhou
چکیده

Most spam filters include some automatic pattern classifiers based on machine learning and pattern recognition techniques. Such classifiers often require a large training set of labeled emails to attain a good discriminant capability between spam and legitimate emails. In addition, they must be frequently updated because of the changes introduced by spammers to their emails to evade spam filters. To address this issue active learning and semi-supervised learning techniques can be used. Many spam filters allow the user to give a feedback on personal emails automatically labeled during filter operation, and some filters include a self-training mechanism to exploit the large number of unlabeled emails collected during filter operation. However, users are usually willing to label only a few emails, and the benefits of selftraining techniques are limited. In this paper we propose an active semi-supervised learning method to better exploit unlabeled emails, which can be easily implemented as a plug-in in real spam filters. Our method is based on clustering unlabeled emails, querying the label of one email per cluster, and propagating such label to the most similar emails of the same cluster. The effectiveness of our method is evaluated using the well known open source SpamAssassin filter, on a large and publicly available corpus of real legitimate and spam emails.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A semi-supervised active learning algorithm for information extraction from textual data

In this article we present a semi-supervised active learning algorithm for pattern discovery in information extraction from textual data. The patterns are reduced regular expressions composed of various characteristics of features useful in information extraction. Our major contribution is a semi-supervised learning algorithm that extracts information from a set of examples labeled as relevant ...

متن کامل

Combining Committee-Based Semi-supervised and Active Learning and Its Application to Handwritten Digits Recognition

Semi-supervised learning reduces the cost of labeling the training data of a supervised learning algorithm through using unlabeled data together with labeled data to improve the performance. Co-Training is a popular semi-supervised learning algorithm, that requires multiple redundant and independent sets of features (views). In many real-world application domains, this requirement can not be sa...

متن کامل

Active, semi-supervised learning to utilize human oracles

We present an approach to interactive machine learning, in which unlabeled data is employed in conjunction with active learning to better utilize the valuable resources that the human oracles provide. We empirically evaluate the approach in two very different applications, smartphone interruptibility prediction and semantic parsing. In both applications, we show that the use of active, semi-sup...

متن کامل

Active Deep Networks for Semi-Supervised Sentiment Classification

This paper presents a novel semisupervised learning algorithm called Active Deep Networks (ADN), to address the semi-supervised sentiment classification problem with active learning. First, we propose the semi-supervised learning method of ADN. ADN is constructed by Restricted Boltzmann Machines (RBM) with unsupervised learning using labeled data and abundant of unlabeled data. Then the constru...

متن کامل

Semi-supervised and Active Training of Conditional Random Fields for Activity Recognition

Automated human activity recognition has attracted increasing attention in the past decade. However, the application of machine learning and probabilistic methods for activity recognition problems has been studied only in the past couple of years. For the first time, this thesis explores the application of semi-supervised and active learning in activity recognition. We present a new and efficie...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009